Boruta - A System for Feature Selection

نویسندگان

  • Miron B. Kursa
  • Aleksander Jankowski
  • Witold R. Rudnicki
چکیده

Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Selection with the Boruta Package

This article describes a R package Boruta, implementing a novel feature selection algorithm for finding all relevant variables. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorith...

متن کامل

bootfs - Bootstrapped feature selection

The usage of the package is illustrated for three classification algorithms: pamr (Prediction analysis for Microarrays, [3], implementation in pamr -Rpackage), rf boruta (Random forests with the Boruta algorithm for feature selection, [2], implementation in Boruta-R-package) and scad (Support Vector Machines with Smoothly Clipped Absolute Deviation feature selection, [4], implementation in the ...

متن کامل

Evaluation of variable selection methods for random forests and omics data sets.

Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the obj...

متن کامل

A Hybrid Random Forests-boruta Feature Selection Algorithm for Biodegradibility Prediction

The a priori knowledge about biodegradability is adopted to save time and money for research and design of new products. Quantitative structure activity relationship (QSAR) models as a tool for biodegradability prediction of chemicals have been encouraged by environmental organizations. In the current work, a new algorithm has been proposed to investigate the importance of chemical descriptors ...

متن کامل

Feature Selection and Predictive Modeling of Housing Data Using Random Forest

Predictive data analysis and modeling involving machine learning techniques become challenging in presence of too many explanatory variables or features. Presence of too many features in machine learning is known to not only cause algorithms to slow down, but they can also lead to decrease in model prediction accuracy. This study involves housing dataset with 79 quantitative and qualitative fea...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Fundam. Inform.

دوره 101  شماره 

صفحات  -

تاریخ انتشار 2010